fix(rdf): converge Fuseki state on weekly rebuilds and isolate API latency#28117
Conversation
…tency
RdfIndexApp ran daily and never reconciled removed relationships, so triples
grew unboundedly across runs. When Fuseki crash-looped on the resulting disk
pressure, every entity-write hook blocked synchronously on the unreachable
server (no HTTP connect timeout, 3-retry loop on ConnectException), saturating
the bounded AsyncService pool and pushing login to ~45s.
Storage-side fixes (stop growth):
- Drop the extractRelationshipTriples "preserve forward" path in
RdfRepository.createOrUpdate; the translator is the source of truth and the
surrounding orchestration already rewrites the current relationship set.
This also removes a wasted CONSTRUCT round-trip per entity write.
- bulkStoreRelationships now does per-source-entity DELETE WHERE with a
predicate-exclusion FILTER for lineage edges, so relationships that no
longer exist actually leave the store.
- Wire RdfRepository.clearAllGlossaryTermRelations() into RdfIndexApp's
initializeJob (the method existed but had no callers).
- Flip recreateIndex default to true and move the cron to Saturday midnight
("0 0 * * 6"). Add reloadOntologies() so CLEAR ALL doesn't leave the
ontology graph empty before indexing starts.
- Include a 2.0.1 post-data migration that updates existing installed_apps
rows; the app loader is insert-only on upgrade.
Connectivity / concurrency fixes (isolate API latency from Fuseki health):
- Add 2s connectTimeout to every JenaFusekiStorage HttpClient and fast-fail
on ConnectException / ClosedChannelException / HttpConnectTimeoutException
instead of retrying. Introduce a 5-failure/30s circuit breaker.
- Route all RdfUpdater mutators through AsyncService.execute with a bounded
pendingWrites gate (cap 1000, drop-on-overflow with logged warning) so a
dead Fuseki can no longer block request threads or starve the AsyncService
pool.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR aims to make RDF/Fuseki indexing converge more reliably and reduce platform latency impact when Fuseki is unhealthy.
Changes:
- Changes RDF app defaults to weekly recreate-index runs and adds migrations for existing app rows.
- Adds Fuseki connection timeout/circuit-breaker handling and async RDF updater submission.
- Adjusts RDF reindex cleanup paths, ontology reload after clear, and related unit tests.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/rdf/RdfIndexAppTest.java |
Adds coverage for ontology reload and glossary relation cleanup behavior. |
openmetadata-service/src/main/resources/json/data/appMarketPlaceDefinition/RdfIndexApp.json |
Updates marketplace default recreateIndex to true. |
openmetadata-service/src/main/resources/json/data/app/RdfIndexApp.json |
Updates app default recreateIndex and weekly cron schedule. |
openmetadata-service/src/main/java/org/openmetadata/service/rdf/storage/JenaFusekiStorage.java |
Adds timeout/circuit-breaker state and relationship reconciliation changes. |
openmetadata-service/src/main/java/org/openmetadata/service/rdf/RdfUpdater.java |
Moves RDF mutating hooks to bounded async submission. |
openmetadata-service/src/main/java/org/openmetadata/service/rdf/RdfRepository.java |
Adds ontology reload and removes relationship preservation during entity writes. |
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/rdf/RdfIndexApp.java |
Wires glossary relation cleanup and ontology reload after full RDF clear. |
bootstrap/sql/migrations/native/2.0.1/postgres/postDataMigrationSQLScript.sql |
Migrates existing PostgreSQL app rows to new RDF app defaults. |
bootstrap/sql/migrations/native/2.0.1/mysql/postDataMigrationSQLScript.sql |
Migrates existing MySQL app rows to new RDF app defaults. |
…surface ontology failures PR #28117 review feedback. Addresses 13 findings across gitar-bot and Copilot: Storage correctness: - JenaFusekiStorage.storeEntity now keeps URI-valued triples (relationships) and only refreshes literal-valued triples. A metadata-only PATCH would otherwise wipe every inter-entity edge until the next weekly recreate-index, and async ordering between updateEntity and addRelationship could leave the graph missing edges (Copilot #1, #2). - RdfRepository.removeRelationship wraps the DELETE in the knowledge named graph and uses getRelationshipPredicate so the predicate URI matches what addRelationship actually wrote (e.g. UPSTREAM → prov:wasDerivedFrom). The previous bare DELETE in the default graph was a silent no-op (Copilot #3). - RdfBatchProcessor now calls a new RdfRepository.clearOutgoingEntityRelationships for every entity in the batch, not just those with current edges. An entity whose last outgoing relationship was removed in MySQL contributes zero RelationshipData entries, so bulkStoreRelationships' per-source DELETE never fired for it (Copilot #4). - bulkStoreRelationships no longer swallows non-connect DELETE errors — DELETE WHERE on a source with no edges is a no-op, so exceptions there are real failures (malformed SPARQL, auth, server errors) and should surface (Copilot #5). Visibility: - reloadOntologies() now checks areOntologiesLoaded() after load and throws if still empty. OntologyLoader.loadOntologies catches internally, so the old reloadOntologies always appeared to succeed (Copilot #6). - clearAllGlossaryTermRelations rethrows on failure instead of silently logging — the indexer's caller can now react to cleanup failures (Copilot #10). - clearAllGlossaryTermRelations pulls custom predicate URIs from GlossaryTermRelationSettings and includes them in the DELETE FILTER. The hardcoded list missed any custom predicates an admin configured (Copilot #7). Quality: - Set / LinkedHashSet imported instead of using java.util.* fully qualified in JenaFusekiStorage and RdfBatchProcessor (gitar-bot #2). - RdfIndexAppTest uses InOrder to assert clearAll → reloadOntologies ordering — a plain verify would have accepted a future change that reordered the calls (Copilot #9). - Documented the residual gap that HttpClient.connectTimeout only bounds TCP connect, not request bodies; circuit breaker + bounded pendingWrites contain the blast radius (Copilot #8). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed all 13 review findings in Storage correctness:
Visibility:
Quality:
Documented gaps:
Treated as false positive:
|
🟡 Playwright Results — all passed (10 flaky)✅ 4070 passed · ❌ 0 failed · 🟡 10 flaky · ⏭️ 92 skipped
🟡 10 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
… all filtered The two EventSubscription-skip tests used verifyNoInteractions on the RDF repository mock, which was valid before because filtered batches never touched RDF. The new per-source reconciliation clear in RdfBatchProcessor.processBatchRelationships now runs for every batch entity regardless of whether its relationships survive filtering — that's deliberate, since stale RDF state for those source entities still needs to be reconciled even when their current MySQL edges all point to excluded entity types. Switch the assertions to verify clearOutgoingEntityRelationships is the sole interaction (no bulkAdd, no addRelationship). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…se it JenaFusekiStorage (org.openmetadata.service.rdf.storage) lives in a different package than RdfRepository (org.openmetadata.service.rdf), so the package-private buildPredicateInList helper introduced in 857c09 couldn't be called from JenaFusekiStorage.bulkStoreRelationships — CI was failing with: [ERROR] JenaFusekiStorage.java:[606,51] buildPredicateInList(Set<String>) is not public in RdfRepository; cannot be accessed from outside package Promote it to public alongside RELATIONSHIP_HOOK_PREDICATES (which is the only data this helper renders) so the cross-package call resolves. Local javac across the touched RDF files now reports zero new errors; the only remaining build failures are the pre-existing es.co.elastic.clients shading issues unrelated to this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… iteration
bulkStoreRelationships' early-return guard accepts sourcesToReconcile == null
as a valid input, but the subsequent per-source DELETE loop iterates
sourcesToReconcile directly — so a caller passing null with a non-empty
relationships list would skip the guard and crash at the for-loop.
Today no caller hits this path (RdfRepository.bulkAddRelationships always
passes non-null, and the 1-arg default interface method derives a set), but
the null-check in the guard explicitly encodes null as supported, so the
contract should match the iteration. Normalise once after the guard:
Set<String> effectiveSources =
sourcesToReconcile != null ? sourcesToReconcile : Set.of();
and use effectiveSources for both the loop and the success-log size.
Local filtered compile passes cleanly (zero NEW errors from RDF files;
remaining errors are the pre-existing es.co.elastic.clients shading mess).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lationships 2-arg signature Three test failures after the Fix-I / atomic-clear-insert changes: - testProcessBatchRelationshipsStoresResults verified `bulkAddRelationships(captor.capture())` (1-arg) but RdfBatchProcessor now calls the 2-arg `bulkAddRelationships(relationships, batchSources)` — Mockito surfaced this as "different arguments" because the actual call had a Set<EntitySourceRef> tail. Updated the verify to `bulkAddRelationships(captor.capture(), anySet())`. - The two event-subscription skip tests previously verified `clearOutgoingEntityRelationships(anySet())` as the only interaction; that method is no longer called from RdfBatchProcessor (the clear was folded into bulkAddRelationships' atomic SPARQL transaction for safety). Replace with `verify(mockRdfRepository).bulkAddRelationships(eq(List.of()), anySet())` — bulkAdd is still invoked with an empty list to drive the per-source reconciliation for the batch entity, even when the only fetched relationship pointed at an excluded entity type. Filtered local compile + test-compile passes cleanly (no NEW errors from RDF files; only pre-existing es.co.elastic.clients shading errors remain). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- collectTranslatorPredicates over-broad (3249798300): RdfRepository.addRelationship
passes storeEntity a model loaded from Fuseki PLUS the new relationship, so the
dynamic walk was pulling hook-managed predicates (om:owns, etc.) into the DELETE
scope. With async writes, two concurrent additions for the same source could
each read the old model and each storeEntity wipe the other's relationship.
Exclude RELATIONSHIP_HOOK_PREDICATES from the walk result (and defensively from
the static-set union too).
- ForkJoinPool.commonPool starvation (3249798327): runWithTimeout used
CompletableFuture.supplyAsync's default executor, so a Fuseki that stalls would
leak workers on the JVM-wide commonPool and starve unrelated CompletableFuture
/ parallel-stream work. Introduce a dedicated virtual-thread executor
(Thread.ofVirtual().name("rdf-storage-timeout-", 0)) and route all timeout
wrappers through it — virtual threads are cheap to leak and the circuit breaker
bounds the pile-up.
- Shrink-to-empty for literal predicates (3249798383): the predicate-scoped DELETE
no longer caught stale literals when a literal-valued field (description /
displayName / …) was cleared and the new model simply omitted the triple. Chain
a "DELETE … FILTER(!isIRI(?o))" pass with the URI-scoped pass so hook-managed
URI triples stay intact while stale literals get swept on every write.
- UI schema default (3249798439): the UI form schema at
utils/ApplicationSchemas/RdfIndexApp.json still declared recreateIndex.default
= false. Flipped to true to match the backend openmetadata-spec schema and the
install JSON files. (The sibling jsons/applicationSchemas/ is gitignored
generated output, no source change needed there.)
Local verification before push: spotless:apply, filtered compile + test-compile
(zero new errors), and `mvn test -Dtest='RdfIndexAppTest,RdfPropertyMapperTest,
RdfPredicatePartitionTest,RdfStorageIdempotencyTest'` — 64 tests, 0 failures.
The "buildPredicateInList package-private" finding from the same review
(3249798351) is already addressed in 03c5d4f and surfaces here only because
Copilot reviewed an earlier commit.
The "lineage incremental cleanup" finding (3249798415) is a known architectural
trade-off: addLineageWithDetails handles current lineage rows but removed edges
have no row to trigger a per-edge delete, and adding UPSTREAM/wasDerivedFrom to
RELATIONSHIP_HOOK_PREDICATES would conflict with the inline addLineageWithDetails
call that runs BEFORE bulkAddRelationships in RdfBatchProcessor. The weekly
recreateIndex=true run (the new default) wipes and rebuilds from MySQL, which
reconciles stale lineage; left this thread open as a documented gap rather
than reordering processBatchRelationships in this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code Review ✅ Approved 6 resolved / 6 findingsImplements atomic RDF reconciliation, Fuseki connection timeouts, and circuit breakers to prevent index bloat and cascading API latency. All identified findings regarding resource management and error handling have been resolved. ✅ 6 resolved✅ Quality: Fully qualified class names used instead of imports in new code
✅ Quality: Fully qualified class names in clearAllGlossaryTermRelations
✅ Edge Case: expandPredicateCurie silently defaults null/empty to relatedTo
✅ Quality: tempModel created in loop scope is never closed
✅ Bug: Hardcoded predicate URIs in DELETE filter ignore configurable baseUri
...and 1 more resolved from earlier reviews OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
|



Describe your changes:
RDF Knowledge Graph indexing was duplicating triples and accumulating disk + memory in Fuseki on every run; when Fuseki crash-looped, every entity-write hook blocked synchronously on the unreachable server (no HTTP timeout, 3-retry loop), saturating the bounded
AsyncServicepool and pushing login latency to ~45 s. The reindex now usesrecreateIndex=trueon a weekly Saturday cadence, every reconciliation path actually deletes removed relationships, and the Fuseki client has a 2 s connect timeout + circuit breaker so a dead Fuseki can no longer block request threads.Type of change:
High-level design:
Storage-side (stop growth):
RdfRepository.createOrUpdateno longer preserves stale relationship triples — the translator is the source of truth and surrounding orchestration rewrites the current set. Also removes a wasted CONSTRUCT round-trip per write.bulkStoreRelationshipsdoes per-source-entityDELETE WHEREwith a predicate-exclusion FILTER for lineage edges, so removed relationships actually leave the store.RdfRepository.clearAllGlossaryTermRelations()is now wired intoRdfIndexApp.initializeJob(the method existed but had no callers).recreateIndexdefault flipped totrue, cron moved to"0 0 * * 6"(Saturday midnight), andreloadOntologies()runs afterclearAll()so the ontology graph isn't left empty.2.0.1/{mysql,postgres}/postDataMigrationSQLScript.sqlto update existinginstalled_appsrows; the app loader is insert-only on upgrade.Connectivity / concurrency (isolate platform from Fuseki health):
JenaFusekiStorageHttpClients now useconnectTimeout=2s; onConnectException/ClosedChannelException/HttpConnectTimeoutExceptionwe fast-fail instead of retrying. A 5-failure/30 s circuit breaker short-circuits subsequent calls until Fuseki recovers (probed viatestConnectionwhich bypasses the breaker).RdfUpdatermutators now go throughAsyncService.execute(...)(the existing virtual-thread pool) with a boundedpendingWritesgate (cap 1000, drop-on-overflow with logged warning) so the request thread returns immediately and a dead Fuseki cannot starveAsyncServicepermits.Tests:
Unit tests
Extended
RdfIndexAppTest:recreateIndex=truetest now also verifiesreloadOntologies()is called afterclearAll().clearAllGlossaryTermRelations()is invoked whenglossaryTermis in the entity set ANDrecreateIndex=false.glossaryTermis absent.Backend integration tests
es.co.elastic.clients.*shading compile issue unrelated to this work that blocks the module build. Once that is fixed, the planned end-to-end tests (re-run indexer twice → triple count unchanged; remove an edge in MySQL → triple disappears in Fuseki; point RDF endpoint at a closed port → write returns <500ms;recreateIndex=true→ ontology graph non-empty after run) should be added.Manual testing performed
Entity.GLOSSARY_TERMconstant, AsyncService API).git stashbaseline to confirm pre-existing compile errors are unrelated to these changes.UI screen recording / screenshots:
Not applicable.
Checklist:
🤖 Generated with Claude Code
Summary by Gitar
DELETEandINSERTSPARQL operations into a single atomic transaction inbulkStoreRelationshipsto prevent partial graph updates.tempModel.close()to release in-memory graph resources during predicate URI generation.broader,narrower, andexactMatchinRdfRepositoryto maintain cleanup integrity duringSettingsCacheoutages.bulkAddRelationshipsto reconcile empty-edge cases correctly while skipping processing when both relationships and sources are empty.RdfBatchProcessorto eliminate redundant reconciliation calls and consolidate atomic updates.This will update automatically on new commits.